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Abstract 

Most visual recognition methods implicitly assume the data distribution remains un¬ 
changed from training to testing. However, in practice domain shift often exists, where 
real-world factors such as lighting and sensor type change between train and test, and 
classifiers do not generalise from source to target domains. It is impractical to train 
separate models for all possible situations because collecting and labelling the data is 
expensive. Domain adaptation algorithms aim to ameliorate domain shift, allowing a 
model trained on a source to perform well on a different target domain. However, even 
for the setting of unsupervised domain adaptation, where the target domain is unlabelled, 
collecting data for every possible target domain is still costly. In this paper, we propose 
a new domain adaptation method that has no need to access either data or labels of the 
target domain when it can be described by a parametrised vector and there exits several 
related source domains within the same parametric space. It greatly reduces the burden 
of data collection and annotation, and our experiments show some promising results. 

Introduction 

Supervised learning usually assumes that the training and testing data are drawn from the 
same underlying distribution. This assumption is easily violated in many real-world prob¬ 
lems. For example, our classifier might be trained with high-quality images captured by a 
HD camera in an ideal environment, while the trained model may be applied to images cap¬ 
tured in poor lighting condition where the camera is not held still, so the image is blurred 
and dark. In this scenario, directly using the pre-trained model leads to poor performance. 

The pervasiveness of this domain shift issue has motivated extensive research in domain 
adaptation [DU, O, El, EB]. Domain adaptation (DA) aims to undo this distribution shift 
between the source domain (where the model is trained) and target domain (where the model 
is applied). DA has two main settings: supervised DA where the target domain has labels but 
its data volume is very small, and unsupervised DA where the target domain is completely 
unlabelled. In this paper, we will focus on unsupervised DA - and eventually a new problem 
setting of zero-shot domain adaptation - as they are more practically useful. 

In most DA studies, ‘domain’ is often equivalent to ‘dataset’ for benchmarking conve¬ 
nience. In the classic Office dataset [EB], images are split into three discrete domains/datasets 
based on their capture device. However, it is commonly overlooked by the DA community 
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that in many cases domains are not determined by a single categorical variable, but rather 
a vector of continuous variables. For example, consider visual surveillance cameras for per¬ 
son or event recognition: the camera angle a affects the poses of people that it captures, and 
time of day T describes the illumination. In this case domains vary smoothly on a continu¬ 
ous manifold, and [a, t] describes a specific domain. Given that both a and T are continuous 
variables, we have, in fact, an infinite number of domains; and each parameter will have a 
certain degree of variation that prevents models trained one one domain from working well 
on another. Discretizing each factor allows conventional DA to be applied; but this inevitably 
leads to information loss, and to a large number of domains. We refer to the parameters such 
as a as domain factors, and the vectors like [ct, t] as domain descriptors. 

With this perspective it is clear that even unsupervised domain adaptation does not scale - 
since the number of datasets to collect grows exponentially in the number of domain factors. 
This motivates our search for a method that can learn from a few domains and then generalise 
well to an arbitrary parametrised domain without collecting the data from it. We refer to this 
scenario as zero-shot domain adaptation (ZSDA), distinct to supervised and unsupervised 
domain adaptation. ZSDA is possible if we can predict some pattern (e.g., the subspace that 
supports the data) of the target domain from the abstract description of the domain descriptor. 
We formulate this task as a problem of manifold-valued data regression. 

ZSDA aims to enable the attractive use case where a model can consistently and instanta¬ 
neously perform well on a varying test data distribution, by being automatically ‘calibrated’ 
on the fly by domain descriptor metadata. This would have important applications in a nu¬ 
merous areas including object [EB], person [El] and audio [E3] recognition. To this end, our 
contributions are three-fold: (i) We propose the novel problem of zero-shot domain adap¬ 
tation, for domains parametrised by continuous vectors, (ii) We provide a solution to the 
proposed problem by designing a novel multivariate regression model for the Grassmannian, 
(iii) We show promising early results demonstrating the efficacy of our ZSDA. 


2 Related Work 

2.1 Zero-Shot Learning 

Zero-Shot Learning (ZSL) has received extensive attention in the computer vision commu¬ 
nity, such as character [113], object [H, O], and action [[H] recognition. Instead of building 
a map (classifier) directly from the image to label space, ZSL studies [US, E3] propose an 
intermediate representation. This representation may take the form of ‘attributes’, or more 
generally, a semantically meaningful descriptor. The motivation is that, assuming the map¬ 
ping from image to attributes is sufficiently universal, it can be learned from a large amount 
of data [Eg, E3]. The mapping can then be used to build recognisers on-the-fly by giving 
the semantic descriptor for a new object. For example by assigning the attributes [‘black’, 
‘white’, ‘stripes’] to the new object ‘zebra’. 

Inspired by ZSL, we aim to achieve a similar on-the-fly capability for domain adaptation. 
In the same way that semantic representations such as attributes enable ZSL, we aim to 
exploit the often freely available domain-metadata to adapt a trained model to any target 
domain given only its descriptor. The closest work in this area is [E3], which first mentioned 
the problem of zero-shot domain adaptation. However it has the drawback that the domain 
descriptor is limited to a vector of categorical factors only. We do not have this restriction. 
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2.2 Domain Adaptation 

Domain Adaptation (DA) techniques reduce the divergence between source and target do¬ 
mains such that a model trained on the source performs well on the target. We focus on 
unsupervised domain adaptation here, as it is more practically valuable. However, it is chal¬ 
lenging because we have zero knowledge of the conditional distribution of the target domain 
P{Y \X'r) thus a discriminative model trained on the source domain P{Y ) can not be lever¬ 
aged. The remaining option is to exploit the marginal distributions P{X'r) and P{Xs). 

There are two main approaches in this area, data and subspace centric. Data: This 
approach seeks a unified transformation 0 ( ) that projects two domains’ data into a new 
space that reduces the discrepancy between the transformed target source data 

(l){Xs). A typical pipeline is to perform PC A [O] or sparse coding [ED] on the union of the 
domains with an additional objective that minimises maximum mean discrepancy (MMD) 
of the new representations of two domains in a reproducing kernel Hilbert space (RKHS) 
TL, i.e., ||E[0(X7-)] — E[(j)(X 5 )]||^, where E[-] is the expectation operator. Subspace: This 
approach does not manipulate the data directly, instead it tries to make use of subspaces of 
two domains. Here, subspace refers to a D-hy-K matrix of the first K eigenvectors induced by 
PCA on the original D-dimensional data. We denote Ps and Pj- as the subspaces of source 
and target domain learned by two separate PCAs. Subspace Alignment (SA) [B] learns a 
linear map M for Ps that minimise the Bregman matrix divergence \\PsM — P-j-Wj^. [O] 
samples several intermediate subspaces Pi,P2 ,...,Pn from Ps to P 7 -. That is achieved by 
thinking of Ps and Pj- as two points on the Grassmann manifold (Grassmannian) G{K,D) 
and finding a geodesic (shortest path on manifold) between them, then the points along the 
geodesic are meaningful subspaces. Then all subspaces are concatenated to form a richer 
linear operator [Ps , A, P 2 ,...,Av ,^ 7 -] that projects two domains into a common space, where 
the source classifier generalises better to the target domain. A weakness of [O] is that the 
number of intermediate points is a hard-to-determine hyper-parameter. An elegant solution to 
this, [DU] samples all the intermediate points. Although this produces infinitely long feature 
vectors, excluding conventional linear classifiers, their dot-product is still defined, and thus 
any kernelised classifier can be used. A recent study [113] considered the case where domains 
are associated with a single continuous variable (time) using a sequential PCA and subspace- 
based DA method. However, it does not extend to a vector domain descriptor, nor to ZSDA. 

Reviewing these DA studies, one could easily conclude that target data are compulsory. 
It initially seems to be impossible to achieve DA without any data in the target domain. How¬ 
ever, if we take a deeper look at the subspace approach, the key is the subspace rather than 
data. If each observed subspace (domain) is associated with an independent vector variable 
z (domain descriptor), it is possible to predict a new subspace P* given its corresponding 
z* via a regression model z —^ P. This is sometimes called manifold-valued data regression 
problem, where the output space is a manifold (e.g., the Grassmannian) and the input is in 
Euclidean space. 


2.3 Manifold-valued Data Regression 

There exists several studies addressing regression in the setting that the independent variable 
is a point in Euclidean space and the dependent variable is a point in non-flat manifold 
space such as Riemann and Grassmann manifolds. Based on their methodology, we can 
group these studies into three categories: (i) Parametric approaches like [□, □, 0, O, O, E3] 
usually try to find a formulation for the geodesic and then provide a numerical solution for 
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its estimation, (ii) Semi-parametric approach, e.g., [IZ3] uses a link function to map from 
Euclidean to Riemannian space, (ii) Non-parametric approaches such as [0, 0] adapt kernel 
regression to the manifold case by observing that they are all essentially about searching for 
a point for which the sum of its (reweighed) distances with all training points is minimised. 

Note that for most parametric solutions, the independent variable is assumed to be a 
scalar (univariate regression). This is because; (i) In applications where these methods are 
popular, e.g., medical imaging, one usually wants to find a pattern against a single factor 
(e.g., age) and (ii) it is technically challenging to extend the method to multivariate case 
mi because the prediction no longer corresponds to a single geodesic curve, which makes 
the gradient derivation problematic. 

We therefore aim to find a solution based on a non-parametric method since the kernel 
function usually does not make assumptions on whether the input is a scalar or a vector. 


3 Methodology 

3.1 Kernel Regression on Grassmannian 

Our goal is to build a regression model that takes an M-dimensional vector of independent 
variables as the input and predicts a point on Grassmannian (represented by a matrix with 
orthonormal columns). The output constraint means it can not be treated as conventional 
Euclidean regression. Therefore, we design a regression model for the Grassmannian. 


3.1.1 Kernel Regresison Review 

We first review kernel regression. Assume we are given a set of (data, label) pairs. 


{{zi,Pi),{Z 2 ,Pi), (1) 

where z S and P S and a kernel function k(zi,Z 2 ) that measures the similarity be¬ 
tween zi and Z 2 . The kernel regression prediction of a test point z is then estimated by. 


i:^ik(z,z,)P,- 

I^ik(z,z,) 


( 2 ) 


3.1.2 From Euclidean to Grassmannian regression 

When P gA4 where Ad is a non-flat manifold and P is no longer a scalar, Eq. 2 can be invalid. 
Eor example, suppose Ad is a Grassmannian G{K,D) so its members are now matrices P G 
constrains that P^P — I^. Eq. 2 can be applied and matrices P, added, but this 
is meaningless because adding two points on the Grassmann manifold does not necessarily 
give another point on Grassmann manifold. 

Inspired by [□], we propose to think of kernel regression as the solution of the following 
optimisation problem: 


N 

argmin^ w,(P-P,)^ 
PgR' i=l 


( 3 ) 
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where w/ = L \ ■ More generally, we have 

L/=i 

N 

argmin^w,d^(P,^;) (4) 

P i=i 

where cl^(-,-) is a metric (distance function). P is the Frechet mean if the minimizer is 
unique (or Karcher mean when it is a local minimum). Note that the Frechet mean is defined 
in general metric space, thus it provides a way to work with manifold-valued data as long as 
we can find a well defined distance function for the points on the manifold. 

Grassmann Manifold Background We first review some basic concepts about the Grass- 
mannian. Many distances on the Grassmannian are defined based on a key concept called 
‘principal angle’, which can be calculated by SVD. E.g., for two points Pi and P 2 on G{K,D), 

pIP2 = USV^ (5) 

where S — diag(cos(0i),cos(02)5 • • ■ ,cos(0if)). The angle 0*. = cos^^ (Sk^k) is the kth princi¬ 
pal angle. Table 1 lists some frequently used distance functions on the Grassmannian. The 
details on deriving these functions and their comparison are beyond the scope of this paper. 
Readers who are interested in this topic should see [O] and [E3] for a good reference. 

Table 1; Distances d^(Pi,P 2 ) on G{K,D) in terms of principal angles and orthonormal bases 



Principal angles 

Orthonormal bases 

Binet-Cauchy distance 

l-nf=icos2 0^ 

1 - (det(pTp 2))2 

Chordal distance 

if^iSin^Pi- 

i||PlPT-P2pT||2 

Martin distance 

lognf=i(cos2 0*^)-' 

-21og(det(Pj2'P2)) 

Procrustes distance 

4lLisin2| 

\\PlU-P2V\\l 


Manifold-valued data regression with vector input For our manifold-valued data regres¬ 
sion task, we choose Binet-Cauchy (BC) distance, because of its favourable sensitivity prop¬ 
erties [O], and because it is amenable to deriving gradients. Substituting the BC distance 
into Eq. 4, we obtain the following objective function to optimise: 

N 

argminl-^w;(det(P^f;))2, (6) 

i=\ 

which is subject to constraint P^P = Ik- The gradient with respect to P is, 

V/. = f -w,(det(P^P,))2i^(P^P,)-^ (7) 

(=1 

Vanilla gradient descent is not applicable because of the orthogonality constraints. It is a 
non-trivial optimisation problem as the constraints lead to non-convexity. A simple solution 
is to do gradient descent as usual and re-orthogonalise the matrix after each step, but it is 
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numerically very expensive. There are some studies on this topic, such as [ED], [IZ3], and 
[EE], We adopt the solution from [EZD]. It applies a Crank-Nicolson-like update scheme that 
preserves the constraints. 

Given a feasible point P and the gradient G = Vp, a skew-symmetric matrix A is defined as. 


A :=GP^ -PG^. 

The new trial point is determined by the Crank-Nicolson-like scheme, 

pUPdate(rj) =P_ ^A(P-fApUpdate(jj)) 

where rj is the learning rate, and is given by the closed form. 


The details of deriving equations (8) - (10) can be found in [BID]. 


p'dpdate(i^) where Q = (/-f ^A)^'(/-^A). 


( 8 ) 

(9) 

( 10 ) 


3.2 Zero-Shot Domain Adaptation 

Using the methodology developed in Sec. 3.1, our ultimate goal of zero-shot domain adap¬ 
tation becomes possible. Assuming that we have N observed domains given by 

{(Xl,[yi],Zi),(X2,[y2],Z2),...,(XA,,[yAr],ZA,)}, (11) 

where X,- is the feature matrix, from which we can learn a subspace P, by PCA. z,- and y, are 
the domain descriptor and label vector respectively for domain i. [y,] indicates that we do 
not assume all observed domains have been labelled. 

For an unseen domain with descriptor z*, we can predict its subspace P* based on the 
proposed method (Eq. 6) and the training data 


{(zi,'f’l),(z2,F2),---,(zA',FV)}- (12) 

Once P* is obtained, any subspace-based DA method (e.g., [Q, DU, O]) can be applied to align 
the unseen (target) domain to any labelled source domain where a classifier was trained. 


4 Experiments 

Dataset: To test our algorithm, we need a dataset which exhibits a range of continuously 
parametrised domains (Sec. 3.1). However, most existing DA and more general vision 
datasets are grouped into discrete domains/datasets. We therefore alter an existing dataset 
for our purposes. We use the Office Dataset [ED], which collects the images of office supplies 
from three sources; Amazon, webcam and DSLR. It is a classic dataset to evaluate domain 
adaptation algorithms. The typical experimental design for this dataset is to evaluate recog¬ 
nition performance when a model is trained on one domain (e.g., Amazon) and tested on 
another (e.g., webcam). To test our algorithm, we create a new dataset based on Office. 
Settings: We use all the Amazon images, which contain 31 categories and with an average 
of 90 images each. Then, we simulate a range of continuously parametrised domains by 
degrading each image by two means; lowering the resolution and reducing the brightness. 
Specifically, we apply a Gaussian filter for simulating low resolution and divide every pixel 
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Figure 1: Example images in category Backpack. (Left) Original. (Right) Nine simulated 
domains 


value by a factor for simulating poor lighting. The size of Gaussian filter and the darkening 
factor provide two factors of the domain descriptor. In this experiment, we generate nine 
distinct domains by three levels of degradation for each parameter as shown in Table 2. An 
example of each domain along with the original image is shown in Fig. 1 . For each domain, 
we split the training and testing set equally, and the assignment of train versus test set is 
consistent for all domains. This guarantees that when an image appears in training set, its 
other versions of degradation will not appear in the testing set. 

Table 2: Nine domains generated by the degradation of Office/Amazon. 

Domain Index 1 2 3 4 5 6 7 8 ^ 

Gaussian Filter Size 5 5 5 10 10 10 15 15 15 

Brightness Factor 1.5 2 3 1.5 2 3 1.5 2 3 


Features: We use the state-of-the-art Convolutional Neural Network (CNN) model VGG- 
full [ID] as the feature extractor. The image is first preprocessed; rescaled into 224 x 224 and 
mean subtracted. Then it is fed into the CNN, where the value in the penultimate layer (4096 
neurons) is used as the feature vector for further experiments. 

4.1 Demonstrating the Domain Shift Challenge 

In the first experiment, we demonstrate how domain shift effects recognition performance, 
and to what degree domain adaptation methods can alleviate it. A linear SVM classifier is 
trained on the original domain’s training data and we evaluate the performance of the classi¬ 
fier measured by accuracy () in the nine degraded domains’ 
test data. Then we apply a popular domain adaptation algorithm - Geodesic Flow Kernel 
(GFK) [DU] that aims to manipulate the target subspace so that domain shift is reduced. 

As we can see in Fig. 2, the CNN feature is very discriminative. The accuracy in the orig¬ 
inal domain’s test data is 81.68% (blue horizontal line in Fig. 2). However, the performance 
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Figure 2: Recognition rate in Office/Amazon. The domain shift problem is significant de¬ 
spite the use of state of the art CNN features. 

Table 3: Recognition accuracy on Domain 5 after zero-shot domain adaptation. 



123456789 
Domain ID 


Source ID 12346789 Avg. 

No DA 66.12 67.33 66.69 71.73 69.74 69.82 68.68 62.93 67.88 

ZSDA^GFK 67.90 69.03 69.03 74.43 71.80 72.66 71.09 66.41 70.29 


‘Source ID’ indicates which one of the source domains is used to train the classifier. ‘No DA’ means applying the trained model 
directly to the target domain. ‘ZSDA^GFK’ is the proposed method: the subspace of the target domain is predicted by all source 
domains, and the domain adaptation is conducted by GFK from each source domain in turn to the target domain. 


inevitably drops when the quality of images gets lower. For the last domain, domain_9, the 
accuracy has declined to 24.29%. Nevertheless, unsupervised domain adaptation by GFK 
does reduce the performance drop. It improves the recognition rate by 7.42% on average. 

From these results we conclude that: (i) state of the art features do not eliminate the 
domain-shift problem (contrary to some claims [ 0 ]), and (ii) domain adaptation methods still 
play an important role in the era of deep learning. 

4.2 Zero-Shot Domain Adaptation 

In the second experiment, we validate the proposed algorithm’s ability to undo domain shift 
given only a target domain descriptor. 

Target domain performance for different sources of labels: The fifth domain, domain_5, 
is chosen to be the target domain, and the evaluation is run on its testing data. Unlike tradi¬ 
tional domain adaptation approaches [Dll], none of the data (either feature or label) in target 
domain is given. All the known information about the target domain is the descriptor of 
domain_5, i.e., Z 5 = [10,2]. Besides this, we assume all the training data from the other 
domains as well as their domain descriptors are given, from which we can learn 8 subspaces 
via PCA (the reduced dimension is K — 512). Then the subspace of domain_5 is predicted 
by the proposed algorithm so that the subspace based domain adaptation [DU] can be applied. 
Note that we rescale each factor in z to [0, 1] to reduce the effect of different ranges of factors, 
and RBF kernel with CJ = 0.1 is chosen to be k(-, •). 

The results are summarised in Table 3. These are generated by predicting the subspace P$ 
of the unseen domain (domain_5) based all the remaining eight source domains and testing 
the classifier (with or without DA) trained from the labeled data of each source domain in 
turn (i.e., single source domain to target). The source ID indicates which source domain is 
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Table 4; Average accuracy on 

every target domain after zero-shot domain adaptation. 

Target ID 

1 

2 

3 

4 

5 

6 

7 8 

9 

No DA 

69.10 

69.15 

68.31 

68.18 

67.88 

63.23 

59.14 56.86 

46.41 

ZSDA^GFK 

72.06 

72.07 

71.24 

71.07 

70.29 

65.94 

61.37 59.54 

50.76 


used to train the classifier and optionally apply DA. Note the distinction between source of 
unlabelled data for subspace learning, and source of labeled data for classifier learning. 

In every case, the proposed ZSDA method generates a useful subspace which allows 
domain adaptation to improve the performance. It is as expected that using ‘closer’ domains 
(measured by distance between z’s) such as domain_2, _4, _6, and _8 to learn the classifier 
should lead to better performance compared to source domains that are ‘far-away’ from the 
target. This is indeed the case; if we focus on the nearby domains_{2,4,6,8}, then the 
average accuracy in Table 3 is 71.59%. Given that the accuracy obtained within domain_5 
only (i.e. training and testing on domain_5) is 72.94%, and the average accuracy of those 
sources without DA is 69.37%, this is a very encouraging result: we have reduced the cross¬ 
domain performance drop by over half without relying on any target domain data. 

Performance for each target domain: The previous analysis fixed domain_5 as the target. 
We next extend this analysis and consider each domain in turn as the target. In each case 
we use the proposed algorithm to infer the subspace of the target domain given the eight 
sources, and we evaluate the recognition accuracies with and without DA. For conciseness, 
we just report the average accuracy over all possible label sources for each target (cf. the last 
column in Table 3). Table 4 summarises the result and it demonstrates the effectiveness and 
robustness of the proposed method. It shows performance improvements on all choices of 
target domains, and on average, it boosts the accuracy by a factor of 4.75%. 


5 Conclusion 

We proposed the problem of continuously-parametrised zero-shot domain adaptation and 
developed a solution based on manifold-valued data regression. This allows us to predict the 
subspace to use at test-time and thus align a source classifier to a test-domain in advance of 
seeing any data. Preliminary results demonstrate the value of our approach. This approach 
is highly promising for its potential impact on a variety of areas where it would be useful to 
be able to ‘calibrate’ a recognition model on the fly based on metadata. 

There are numerous areas for future work. A more thorough evaluation that also covers a 
wider range of applications is clearly needed. Our current kernel regression-based method is 
weak in domain extrapolation in contrast to interpolation. This is important for some kinds of 
subspace prediction, e.g., predicting future subspaces when one domain factor is time. Thus 
a generalisation for extrapolation is of interest. While the proposed method can and must use 
a set of available source domains to learn the subspace regressor, it can only exploit a single 
source domain’s labels. A useful extension would therefore be to exploit multiple source 
domains worth of labels. Besides, it is interesting is to see if the predicted subspace can act 
as a regulariser so that it still helps when the target data are available but limited. Finally, a 
key assumption of this paper (in common with most other domain adaptation work) is that 
the domain descriptor is always observed and accurate. To relax this assumption and enable 
the model to deal with missing or noisy descriptors is also an interesting direction. 
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